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Abstract 

We extend the SKIP-GRAM model of lMikolo^ 
et al. (2013a| l by taking visual information into 
account. Like SKIP-GRAM, our multimodal 
models (MMS KIP-GRAM) build vector-based 
word representations by learning to predict 
linguistic contexts in text corpora. However, 
for a restricted set of words, the models are 
also exposed to visual representations of the 
objects they denote (extracted from natural 
images), and must predict linguistic and visual 
features jointly. The MMSkip-GRAM mod¬ 
els achieve good performance on a variety of 
semantic benchmarks. Moreover, since they 
propagate visual information to all words, we 
use them to improve image labeling and re¬ 
trieval in the zero-shot setup, where the test 
concepts are never seen during model training. 
Finally, the MMS KIP-GRAM models discover 
intriguing visual properties of abstract words, 
paving the way to realistic implementations of 
embodied theories of meaning. 


1 Introduction 


Distributional semantic models (DSMs) derive 
vector-based representations of meaning from pat¬ 
terns of word co-occurrence in corpora. DSMs have 
been very effectively applied to a variety of seman¬ 
tic tasks ( Clark, 2015| Mikolov et al., 2013bt Turney 
and Pantel, 20101. However, compared to human 


semantic knowledge, these purely textual models, 


just like traditional symbolic Al systems (Harnad, 


1990} [Searle, 1984 1, are severely impoverished, suf¬ 
fering of lack of grounding in extra-linguistic modal¬ 


ities (Glenberg and Robertson, 2000). This observa 


tion has led to the development of multimodal dis¬ 
tributional semantic models (MDSMs) (|Bruni et al., 


2014t |Feng and Lapata, 20101 [Silberer and Lapat^ 


2014[), that enrich linguistic vectors with perceptual 


information, most often in the form of visual fea¬ 
tures automatically induced from image collections. 

MDSMs outperform state-of-the-art text-based 
approaches, not only in tasks that directly require 
access to visual knowledge (|Bruni et al., 2012l, but 


also on general semantic benchmarks (Bruni et al., 


2014t Silberer and Lapata, 20141. However, current 
MDSMs still have a number of drawbacks. First, 
they are generally constructed by first separately 
building linguistic and visual representations of the 
same concepts, and then merging them. This is ob¬ 
viously very different from how humans learn about 
concepts, by hearing words in a situated perceptual 
context. Second, MDSMs assume that both linguis¬ 
tic and visual information is available for all words, 
with no generalization of knowledge across modal¬ 
ities. Third, because of this latter assumption of 
full linguistic and visual coverage, current MDSMs, 
paradoxically, cannot be applied to computer vision 
tasks such as image labeling or retrieval, since they 
do not generalize to images or words beyond their 
training set. 

We introduce the multimodal skip-gram models, 
two new MDSMs that address all the issues above. 
The models build upon the very effective skip-gram 
approach of Mikolov et al. (2013a I, that constructs 
vector representations by learning, incrementally, to 
predict the linguistic contexts in which target words 
occur in a corpus. In our extension, for a subset 
of the target words, relevant visual evidence from 





































natural images is presented together with the cor¬ 
pus contexts (just like humans hear words accompa¬ 
nied by concurrent perceptual stimuli). The model 
must learn to predict these visual representations 
jointly with the linguistic features. The joint objec¬ 
tive encourages the propagation of visual informa¬ 
tion to representations of words for which no direct 
visual evidence was available in training. The result¬ 
ing multimodally-enhanced vectors achieve remark¬ 
ably good performance both on traditional seman¬ 
tic benchmarks, and in their new application to the 
“zero-shot” image labeling and retrieval scenario. 
Very interestingly, indirect visual evidence also af¬ 
fects the representation of abstract words, paving the 
way to ground-breaking cognitive studies and novel 
applications in computer vision. 


2 Related Work 


There is by now a large literature on multimodal 
distributional semantic models. We focus here on 
a few representative systems. Bruni et al. (2014] | 
propose a straightforward approach to MDSM in¬ 
duction, where text- and image-based vectors for the 
same words are constructed independently, and then 
“mixed” by applying the Singular Value Decompo¬ 
sition to their concatenation. An empirically supe¬ 
rior model has been proposed by [Silberer and La- 
pata (201?| ), who use more advanced visual repre¬ 
sentations relying on images annotated with high- 
level “visual attributes”, and a multimodal fusion 
strategy based on stacked autoencoders. [Kiela and 
Bottou (2014| ) adopt instead a simple concatena¬ 
tion strategy, but obtain empirical improvements by 
using state-of-the-art convolutional neural networks 
to extract visual features, and the skip-gram model 
for text. These and related systems take a two- 
stage approach to derive multimodal spaces (uni- 
modal induction followed by fusion), and they are 
only tested on concepts for which both textual and 
visual labeled training data are available (the pio¬ 


neering model of Feng and Lapata (20101 did learn 
from text and images jointly using Topic Models, 


but was shown to be empirically weak by Bruni et 
al. (20 ITU ). 


Howell et al. (2005 1 propose an incremental mul¬ 


timodal model based on simple recurrent networks 


(Elman, 19901, focusing on grounding propagation 


from early-acquired concrete words to a larger vo¬ 
cabulary. However, they use subject-generated fea¬ 
tures as surrogate for realistic perceptual informa¬ 
tion, and only test the model in small-scale simula¬ 
tions of word learning. Hill and Korhonen (20141, 
whose evaluation focuses on how perceptual infor¬ 
mation affects different word classes more or less 
effectively, similarly to Howell et ah, integrate per¬ 
ceptual information in the form of subject-generated 
features and text from image annotations into a skip- 
gram model. They inject perceptual information 
by merging words expressing perceptual features 
with corpus contexts, which amounts to linguistic- 
context re-weighting, thus making it impossible to 
separate linguistic and perceptual aspects of the in¬ 
duced representation, and to extend the model with 
non-linguistic features. We use instead authentic im¬ 
age analysis as proxy to perceptual information, and 
we design a robust way to incorporate it, easily ex¬ 
tendible to other signals, such as feature norm or 
brain signal vectors (Fyshe et ah, 20141. 

The recent work on so-called zero-shot learning 
to address the annotation bottleneck in image la¬ 


beling (Frome et ah, 2013 Fazaridou et ah, 2014 


Socher et ah, 20131 looks at image- and text-based 


vectors from a different perspective. Instead of com¬ 
bining visual and linguistic information in a com¬ 
mon space, it aims at learning a mapping from 
image- to text-based vectors. The mapping, induced 
from annotated data, is then used to project images 
of objects that were not seen during training onto 
linguistic space, in order to retrieve the nearest word 
vectors as labels. Multimodal word vectors should 
be better-suited than purely text-based vectors for 
the task, as their similarity structure should be closer 
to that of images. However, traditional MDSMs can¬ 
not be used in this setting, because they do not cover 
words for which no manually annotated training im¬ 
ages are available, thus defeating the generalizing 
purpose of zero-shot learning. We will show be¬ 
low that our multimodal vectors, that are not ham¬ 
pered by this restriction, do indeed bring a signifi¬ 
cant improvement over purely text-based linguistic 
representations in the zero-shot setup. 

Multimodal language-vision spaces have also 
been developed with the goal of improving cap¬ 
tion generation/retrieval and caption-based image 
retrieval ( [Karpathy et ah, 2014 Kiros et ah, 2014| 




































Mao et aL, 2014[[Socher et al., 2014 1. These meth¬ 


ods rely on necessarily limited collections of cap¬ 
tioned images as sources of multimodal evidence, 
whereas we automatically enrich a very large corpus 
with images to induce general-purpose multimodal 
word representations, that could be used as input 
embeddings in systems specifically tuned to caption 
processing. Thus, our work is complementary to this 
line of research. 


3 Multimodal Skip-gram Architecture 

3.1 Skip-gram Model 


We start by reviewing the standard Skip-GRAM 
model of [Mikolov et al. (2013a I, in the version 
we use. Given a text corpus, SKIP-GRAM aims 
at inducing word representations that are good at 
predicting the context words surrounding a target 
word. Mathematically, it maximizes the objective 
function: 


iS( 


^ogp{wt+j\wt) 
-c<j<c,j T^O J 


( 1 ) 


where wi,W 2 ,---,wt are words in the training 
corpus and c is the size of the window around 
target wt, determining the set of context words to 
be predicted by the induced representation of wt- 
Following Mikolov et ah, we implement a subsam¬ 
pling option randomly discarding context words as 
an inverse function of their frequency, controlled by 
hyperparameter t. The probability p{wt+j\wt), the 
core part of the objective in Equation [T] is given by 
softmax: 

p{wt+j\wt) = —— -- (2) 


where and u'^ are the context and target vector 
representations of word w respectively, and W is 
the size of the vocabulary. Due to the normaliza¬ 
tion term. Equation requires 0(|VF|) time com¬ 
plexity. A considerable speedup to 0(log|VF|), is 
achieved by using the hierarchical version of Equa- 
tion[^ ([Morin and Bengio, 20051, adopted here. 


3.2 Injecting visual knowledge 

We now assume that word learning takes place in a 
situated context, in which, for a subset of the target 
words, the corpus contexts are accompanied by a 



the cute little sat on the mat CAT 



Eigure 1: “Cartoon” of MMSkip-GRAM-B. Ein- 
guistic context vectors are actually associated to 
classes of words in a tree, not single words. SKIP- 
GRAM is obtained by ignoring the visual objective, 
MMSkip-GRAM-A by fixing to the identity 

matrix. 


visual representation of the concepts they denote 
(just like in a conversation, where a linguistic 
utterance will often be produced in a visual scene 
including some of the word referents). The visual 
representation is also encoded in a vector (we 
describe in Section |4] below how we construct 
it). We thus make the skip-gram “multimodal” by 
adding a second, visual term to the original linguis¬ 
tic objective, that is, we extend Equation[T]as follow: 

1 ^ 

^ ^ ^ T ^visionitOt)) (3) 

^ t=l 

where Cung{wt) is the text-based skip-gram ob¬ 
jective whereas the 

C,vision{wt) tenn forccs word representations to take 
visual information into account. Note that if a word 
Wt is not associated to visual information, as is 
systematically the case, e.g., for determiners and 
non-imageable nouns, but also more generally for 
any word for which no visual data are available, 
^visionitOt) Is SCt tO 0. 

We now propose two variants of the visual objec¬ 
tive, resulting in two distinguished multi-modal ver¬ 
sions of the skip-gram model. 

3.3 Multi-modal Skip-gram Model A 

One way to force word embeddings to take visual 
representations into account is to try to directly 
increase the similarity (expressed, for example, 
by the cosine) between linguistic and visual rep- 













resentations, thus aligning the dimensions of the 
linguistic vector with those of the visual one (recall 
that we are inducing the first, while the second is 
fixed), and making fhe linguistic represenfafion of a 
concepf “move” closer fo ifs visual represenfafion. 
We maximize similarify fhrough a max-margin 
framework commonly used in models connecting 


language and vision (Weston el ah, 2010 Frome el 


ah, 20131. More precisely, we formulate fhe visual 
objective Cyision{wt) as: 

- ^ ma.x{0,y-cos{u^^,v^^) + cos{u^^,v^i)) (4) 


w' '^Pjl (lo) 


where fhe minus sign lurns a loss into a cosl, 7 is 
fhe margin, Uwt is the target multimodally-enhanced 
word representation we aim to learn, Vy^ is the cor¬ 
responding visual vector (fixed in advance) and Vy,/ 
ranges over visual represenlafions of words (fea¬ 
tured in our image dictionary) randomly sampled 
from distribution Pn{wt). These random visual rep¬ 
resentations act as “negative” samples, encouraging 
Uym to be more similar to its own visual representa¬ 
tion than to that of other words. The sampling distri¬ 
bution is currently set to uniform, and the number of 
negative samples controlled by hyperparameter k. 


3.4 Multi-modal Skip-gram Model B 

The visual objective in MMSkip-GRAM-A has the 
drawback of assuming a direct comparison of lin¬ 
guistic and visual representations, constraining them 
to be of equal size. MMSkip-GRAM-B lifts this 
constraint by including an extra layer mediating be¬ 
tween linguistic and visual representations (see Fig¬ 
ure [T]for a sketch of MMSkip-GRAM-B). Learning 
this layer is equivalent to estimating a cross-modal 
mapping matrix from linguistic onto visual repre¬ 
sentations, jointly induced with linguistic word em¬ 
beddings. The extension is straightforwardly imple¬ 
mented by substituting, into Equation the word 
representation Uy^ with Zy^ = where 

cross-modal mapping matrix to be in¬ 
duced. To avoid overfitting, we also add an L2 reg¬ 
ularization term for to the overall objective 

(Equation]^, with its relative importance controlled 
by hyperparamer A. 


4 Experimental Setup 


Our text corpus is a Wikipedia 2009 dump compris¬ 
ing approximately 800M tokens{^To train the multi¬ 
modal models, we add visual information for 5,100 
words that have an entry in ImageNet (|Deng et ah. 


20091, occur at least 500 times in the corpus and 


have concreteness score > 0.5 according to Turney 


et al. (2011 1 . On average, about 5% tokens in the 


text corpus are associated to a visual representation. 
To construct the visual representation of a word, we 
sample 100 pictures from its ImageNet entry, and 
extract a 4096-dimensional vector from each picture 
using the Caffe toolkit (Jia et ah, 2014), together 
with the pre-trained convolutional neural network of 


Krizhevsky et al. (2012). The vector corresponds 


to activation in the top (FC7) layer of the network. 
Einally, we average the vectors of the 100 pictures 
associated to each word, deriving 5,100 aggregated 
visual representations. 


Hyperparameters Eor both Skip-gram and the 
MMSkip-GRAM models, we fix hidden layer size 
to 300. To facililale comparison befween MMS KIP- 
GRAM-A andMMSKiP-GRAM-B, and since the for¬ 
mer requires equal linguistic and visual dimension¬ 
ality, we keep the first 300 dimensions of the visual 
vectors. Eor the linguistic objective, we use hierar¬ 
chical softmax with a Huffman frequency-based en¬ 
coding tree, setting frequency subsampling option 
t = 0.001 and window size c = 5, without tuning. 
The following hyperparameters were tuned on the 
text9 corpus]^ MMSkip-gram-A: A:=20, 7=0.5; 
MMSkip-GRAM-B: A:=5, 7=0.5, A=0.0001. 


5 Experiments 

5.1 Approximating human judgments 

Benchmarks A widely adopted way to test DSMs 
and their multimodal extensions is to measure how 
well model-generated scores approximate human 
similarity judgments about pairs of words. We put 
together various benchmarks covering diverse as¬ 
pects of meaning, to gain insights on the effect of 
perceptual information on different similarity facets. 
Specifically, we fesf on general relafedness {MEN, 
Bruni ef al. (2014| ), 3K pairs), e.g., pickles are re- 
lafed to hamburgers, semanfic (« faxonomic) simi- 


The parameters of all models are estimated by back- ihttp : / /wacky. ssimit. unibo. it 
propagation of error via stochastic gradienf descent "http://mattmahoney.net/dc/textdata.htmi 


















larity {Simlex-999, Hill et al. (20141, IK pairs; Sem- 
Sim, Silberer and Lapata (2014| |, 7.5K pairs), e.g., 
pickles are similar to onions, as well as visual sim¬ 


ilarity (VisSim, Silberer and Lapata (2014| ), same 
pairs as SemSim with different human ratings), e.g., 
pickles look like zucchinis. 

Alternative Multimodal Models We compare 
our models against several recent alternatives. We 
test the vectors made available by [Kiela and Bottou 


(2014). Similarly to us, they derive textual features 


with the skip-gram model (from a portion of the 
Wikipedia and the British National Corpus) and use 
visual representations extracted from the ESP data¬ 


set (von Ahn and Dabbish, 2004) through a convo 


lutional neural network (Oquab et al., 2014). They 
concatenate textual and visual features after normal¬ 
izing to unit length and centering to zero mean. We 
also test the vectors that performed best in the evalu¬ 
ation of Bruni et al. (2014| ), based on textual features 
extracted from a 3B-token corpus and SlFT-based 


on words without direct visual representations. 

Results The state-of-the-art visual CNN FEA¬ 
TURES alone perform remarkably well, outperform¬ 
ing the purely textual model (Skip-GRAM) in two 
tasks, and achieving the best absolute performance 
on the visual-coverage subset of Simlex-999. Re¬ 
garding multimodal fusion (that is, focusing on 
the visual-coverage subsets), both MMS kip-gram 
models perform very well, at the top or just below 
it on all tasks, with comparable results for the two 
variants. Their performance is also good on the 
full data sets, where they consistently outperform 
Skip-GRAM and SVD (that is much more strongly 
affected by lack of complete visual information). 
They’re just a few points below the state-of-the-art 


MEN correlation (0.8), achieved by Baroni et al. 


(2014) with a corpus 3 larger than ours and exten¬ 


Bag-of-Visual-Words visual features (Sivic and Zis- 
serman, 2003|) extracted from the ESP collection. 


Bruni and colleagues fuse a weighted concatenation 
of the two components through SVD. We further re¬ 
implement both methods with our own textual and 
visual embeddings as CONCATENATION and SVD 
(with target dimensionality 300, picked without tun¬ 
ing). Finally, we present for comparison the results 


sive tuning. MMSkip-GRAM-B is close to the state 
of the art for Simlex-999, reported by the resource 
creators to be at 0.41 ( Hill et al., 2014| ). Most im¬ 
pressively, MMSkip-gram-A reaches the perfor¬ 
mance level of the Silberer and Eapata (2014[ ) model 
on their SemSim and VisSim data sets, despite the 
fact that the latter has full visual-data coverage and 
uses attribute-based image representations, requir¬ 
ing supervised learning of attribute classifiers, that 
achieve performance in the semantic tasks compa¬ 
rable or higher than that of our CNN features (see 


on SemSim and VisSim reported by Silberer and Ea- Table 3 in Silberer and Eapata (2014)). Finally, if 


pata (2014), obtained with a stacked-autoencoders 


architecture run on textual features extracted from 
Wikipedia with the Strudel algorithm ( [Baroni et al., 
2010]) and attribute-based visual features (Farhadi et 


al., 2009) extracted from ImageNet. 


All benchmarks contain a fair amount of words 
for which we did not use direct visual evidence. We 
are interested in assessing the models both in terms 
of how they fuse linguistic and visual evidence when 
they are both available, and for their robustness in 
lack of full visual coverage. We thus evaluate them 
in two settings. The visual-coverage columns of Ta- 
ble[T](those on the right) report results on the subsets 
for which all compared models have access to direct 
visual information for both words. We further report 
results on the full sets (“100%” columns of Table 
[T]) for models that can propagate visual information 
and that, consequently, can meaningfully be tested 


the multimodal models (unsurprisingly) bring about 
a large performance gain over the purely linguistic 
model on visual similarity, the improvement is con¬ 
sistently large also for the other benchmarks, con¬ 
firming that multimodality leads to better semantic 
models in general, that can help in capturing differ¬ 
ent types of similarity (general relatedness, strictly 
taxonomic, perceptual). 

While we defer to further work a better un¬ 
derstanding of the relation between multimodal 
grounding and different similarity relations. Table 
1^ provides qualitative insights on how injecting 
visual information changes the structure of se¬ 
mantic space. The top Skip-GRAM neighbours of 
donuts are places where you might encounter them, 
whereas the multimodal models relate them to other 
take-away food, ranking visually-similar pizzas at 
the top. The owl example shows how multimodal 













































Model 

MEN 

Simlex-999 

SemSim 

Vis Sim 

100% 

42% 

100% 

29% 

100% 

85% 

100% 

85% 

Kiela and Bottou 

- 

0.74 

- 

0.33 

- 

0.60 

- 

0.50 

Bruni et al. 

- 

0.77 

- 

0.44 

- 

0.69 

- 

0.56 

SiLBERER AND LAPATA 

- 

- 

- 

- 

0.70 

- 

0.64 

- 

CNN FEATURES 

- 

0.62 

- 

0.54 

- 

0.55 

- 

0.56 

Skip-gram 

0.70 

0.68 

0.33 

0.29 

0.62 

0.62 

0.48 

0.48 

Concatenation 

- 

0.74 

- 

0.46 

- 

0.68 

- 

0.60 

SVD 

0.61 

0.74 

0.28 

0.46 

0.65 

0.68 

0.58 

0.60 

MMSkip-gram-A 

0.75 

0.74 

0.37 

0.50 

0.72 

0.72 

0.63 

0.63 

MMSkip-gram-B 

0.74 

0.76 

0.40 

0.53 

0.66 

0.68 

0.60 

0.60 


Table 1: Spearman correlation between model-generated similarities and human judgments. Right columns 
report correlation on visual-coverage subsets (percentage of original benchmark covered by subsets on first 
row of respective columns). First block reports results for out-of-the-box models; second block for visual 
and textual representations alone; third block for our implementation of multimodal models. 


Target 

Skip-gram 

MMSkip-GRAM-A 

MMSkip-GRAM-B 

donut 

owl 

fridge, diner, candy 
pheasant, woodpecker, squirrel 

pizza, sushi, sandwich 
eagle, woodpecker, falcon 

pizza, sushi, sandwich 
eagle, falcon, hawk 

mural 

tobacco 

depth 

chaos 

sculpture, painting, portrait 
coffee, cigarette, corn 
size, bottom, meter 
anarchy, despair, demon 

painting, portrait, sculpture 
cigarette, cigar, com 
sea, underwater, level 
demon, anarchy, destruction 

painting, portrait, sculpture 
cigarette, cigar, smoking 
sea, size, underwater 
demon, anarchy, shadow 


Table 2: Ordered top 3 neighbours of example words in purely textual and multimodal spaces. Only donut 
and owl were trained with direct visual information. 


models pick taxonomically closer neighbours of 
concrete objects, since often closely related things 
also look similar ( Bruni et al., 2014) . In particular, 
both multimodal models get rid of squirrels and 
offer other birds of prey as nearest neighbours. 
No direct visual evidence was used to induce the 
embeddings of the remaining words in the table, that 
are thus influenced by vision only by propagation. 
The subtler but systematic changes we observe in 
such cases suggest that this indirect propagation 
is not only non-damaging with respect to purely 
linguistic representations, but actually beneficial. 
For the concrete mural concept, both multimodal 
models rank paintings and portraits above less 
closely related sculptures (they are not a form of 
painting). For tobacco, both models rank cigarettes 
and cigar over coffee, and MMSkip-GRAM-B 
avoids the arguably less common “crop” sense 
cued by corn. The last two examples show how the 
multimodal models turn up the embodiment level 
in their representation of abstract words. For depth, 
their neighbours suggest a concrete marine setup 


over the more abstract measurement sense picked 
by the MMSkip-GRAM neighbours. For chaos, 
they rank a demon, that is, a concrete agent of chaos 
at the top, and replace the more abstract notion of 
despair with equally gloomy but more imageable 
shadows and destruction (more on abstract words 
below). 

5.2 Zero-shot image labeling and retrieval 

The multimodal representations induced by our 
models should be better suited than purely text- 
based vectors to label or retrieve images. In particu¬ 
lar, given that the quantitative and qualitative results 
collected so far suggest that the models propagate 
visual information across words, we apply them to 
image labeling and retrieval in the challenging zero- 
shot setup (see Section|^above)0 

^We will refer here, for conciseness’ sake, to image label¬ 
ing/retrieval, but, as our visual vectors are aggregated represen¬ 
tations of images, the tasks we’re modeling consist, more pre¬ 
cisely, in labeling a set of pictures denoting the same object and 
retrieving the corresponding set given the name of the object. 


















Setup We take out as test set 25% of the 5. IK 
words we have visual vectors for. The multimodal 
models are re-trained without visual vectors for 
these words, using the same hyperparameters as 
above. For both tasks, the search for the correct 
word labeVimage is conducted on the whole set of 
5. IK word/visual vectors. 

In the image labeling task, given a visual vector 
representing an image, we map it onto word space, 
and label the image with the word corresponding 
to the nearest vector. To perform the vision-to- 
language mapping, we train a Ridge regression by 5- 
fold cross-validation on the test set (for Skip-GRAM 
only, we also add the remaining 75% of word-image 
vector pairs used in estimating the multimodal mod¬ 
els to the Ridge training data)|^ 

In the image retrieval task, given a linguis¬ 
tic/multimodal vector, we map it onto visual space, 
and retrieve the nearest image. For SKIP-GRAM, we 
use Ridge regression with the same training regime 
as for the labeling task. For the multimodal mod¬ 
els, since maximizing similarity to visual represen¬ 
tations is already part of their training objective, we 
do not fit an extra mapping function. For MMSkip- 
GRAM-A, we directly look for nearest neighbours 
of the learned embeddings in visual space. For 
MMSkip-gram-B, we use the mapping 

function induced while learning word embeddings. 



P@1 

P@2 

P@10 

P@20 

P@50 

Skip-gram 

1.5 

2.6 

14.2 

23.5 

36.1 

MMSkip-gram-A 

2.1 

3.7 

16.7 

24.6 

37.6 

MMSkip-gram-B 

2.2 

5.1 

20.2 

28.5 

43.5 


Table 3: Percentage precision® fc results in the zero- 
shot image labeling task. 



P@1 

P@2 

P@10 

P@20 

P@50 

Skip-gram 

1.9 

3.3 

11.5 

18.5 

30.4 

MMSkip-gram-A 

1.9 

3.2 

13.9 

20.2 

33.6 

MMSkip-gram-B 

1.9 

3.8 

13.2 

22.5 

38.3 


Table 4: Percentage precision® A: results in the zero- 
shot image retrieval task. 

embeddings we are inducing, while general enough 
to achieve good performance in the semantic tasks 
discussed above, encode sufficient visual informa¬ 
tion for direct application to image analysis tasks. 
This is especially remarkable because the word vec¬ 
tors we are testing were not matched with visual 
representations at model training time, and are thus 
multimodal only by propagation. The best perfor¬ 
mance is achieved by MMSkip-GRAM-B, confirm¬ 
ing our claim that its matrix acts as a multi¬ 

modal mapping function. 

5.3 Abstract words 


Results In image labeling (Table SKIP-GRAM 
is outperformed by both multimodal models, con¬ 
firming that these models produce vectors that are 
directly applicable to vision tasks thanks to visual 
propagation. The most interesting results however 
are achieved in image retrieval (Table |^, which 
is essentially the task the multimodal models have 
been implicitly optimized for, so that they could be 
applied to it without any specific training. The strat¬ 
egy of directly querying for the nearest visual vec¬ 
tors of the MMSKIP-GRAM-A word embeddings 
works remarkably well, outperforming on the higher 
ranks SKIP-GRAM, which requires an ad-hoc map¬ 
ping function. This suggests that the multimodal 


use one fold to tune Ridge A, three to estimate the map¬ 
ping matrix and test in the last fold. To enforce strict zero-shot 
conditions, we exclude from the test fold labels occurring in 
the LSVRC2012 set that was employed to train the CNN of 


Krizhevsky et al. (2012 1 , that we use to extract visual features. 


We have already seen, through the depth and chaos 
examples of Table that the indirect influence of 
visual information has interesting effects on the rep¬ 
resentation of abstract terms. The latter have re¬ 
ceived little attention in multimodal semantics, with 
Hill and Korhonen (20141 concluding that abstract 
nouns, in particular, do not benefit from propagated 
perceptual information, and their representation is 
even harmed when such information is forced on 
them (see Figure 4 of their paper). Still, embod¬ 
ied theories of cognition have provided considerable 
evidence that abstract concepts are also grounded 
in the senses (Barsalou, 2008t Lakoff and John-] 


son, 19991. Since the word representations produced 


by MMSkip-GRAM-A, including those pertaining 
to abstract concepts, can be directly used to search 
for near images in visual space, we decided to ver¬ 
ify, experimentally, if these near images (of concrete 
things) are relevant not only for concrete words, as 
























expected, but also for abstract ones, as predicted by 
embodied views of meaning. 

More precisely, we focused on the set of 200 
words that were sampled across the USF norms con¬ 
creteness spectrum by Kiela et al. (20141 (2 words 
had to be excluded for technical reasons). This 
set includes not only concrete {meat) and abstract 
{thought) nouns, but also adjectives {boring), verbs 
{teach), and even grammatical terms {how). Some 
words in the set have relatively high concreteness 
ratings, but are not particularly imageable, e.g.: 
hot, smell, pain, sweet. For each word in the set, 
we extracted the nearest neighbour picture of its 
MMSkip-GRAM-A representation, and matched it 
with a random picture. The pictures were selected 
from a set of 5,100, all labeled with distinct words 
(the picture set includes, for each of the words as¬ 
sociated to visual information as described in Sec¬ 
tion 1^ the nearest picture to its aggregated visual 
representation). Since it is much more common for 
concrete than abstract words to be directly repre¬ 
sented by an image in the picture set, when search¬ 
ing for the nearest neighbour we excluded the pic¬ 
ture labeled with the word of interest, if present (e.g., 
we excluded the picture labeled tree when picking 
the nearest neighbour of the word tree). We ran a 
CrowdFloweij^ survey in which we presented each 
test word with the two associated images (random¬ 
izing presentation order of nearest and random pic¬ 
ture), and asked subjects which of the two pictures 
they found more closely related to the word. We 
collected minimally 20 judgments per word. Sub¬ 
jects showed large agreement (median proportion of 
majority choice at 90%), confirming that they under¬ 
stood the task and behaved consistently. 

We quantify performance in terms of proportion 
of words for which the number of votes for the near¬ 
est neighbour picture is significantly above chance 
according to a two-tailed binomial test. We set sig¬ 
nificance af p<0.05 affer adjusfing all p-values wifh 
fhe Holm correction for running 198 sfafisfical fesfs. 
The resulfs in Table indicate fhaf, in abouf half 
fhe cases, fhe nearesf picfure fo a word MMSkip- 
GRAM-A represenfafion is meaningfully related fo 
fhe word. As expecfed, fhis is more often fhe case for 
concrete fhan absfracf words. Sfill, we also observe a 


http:// WWW .crowdflower.com 



global 

\words\ 

unseen 

\words\ 

all 

48% 

198 

30% 

127 

concrete 

73% 

99 

53% 

30 

abstract 

23% 

99 

23% 

97 


Table 5: Subjecfs’ preference for nearesf visual 
neighbour of words in Kiela ef al. (2014) vs. random 
picfures. Figure of merif is percenfage proportion 
of significanf resulfs in favor of nearesf neighbour 
across words. Resulfs are reported for fhe whole sef, 
as well as for words above {concrete) and below {ab¬ 
stract) fhe concreteness rating median. The unseen 
column reporfs resulfs when words exposed fo direcf 
visual evidence during fraining are discarded. The 
words columns reporf sef cardinalify. 


freedom theory 


wrong 




together 


place 


Figure 2: Examples of nearesf visual neighbours of 
some absfracf words: on fhe leff, cases where sub¬ 
jecfs preferred fhe neighbour fo fhe random foil; on 
fhe righf, cases where fhey did nof. 


significanf preference for fhe model-predicted near- 
esf picfure for abouf one fourfh of fhe absfracf terms. 
Whether a word was exposed to direct visual evi¬ 
dence during training is of course making a big dif¬ 
ference, and this factor interacts with concreteness, 
as only two abstract words were matched with im¬ 
ages during training]^ When we limit evaluation to 
word representations that were not exposed to pic¬ 
tures during training, the difference between con¬ 
crete and abstract terms, while still large, becomes 
less dramatic than if all words are considered. 

Figure shows four cases in which subjects ex¬ 
pressed a strong preference for the nearest visual 
neighbour of a word. Freedom, god and theory are 
strikingly in agreement with the view, from embod¬ 
ied theories, that abstract words are grounded in rel- 

®In both cases, the images actually depict concrete senses of 
the words: a memory board for memory and a stop sign for stop. 

















evant concrete scenes and situations. The together 
example illustrates how visual data might ground ab¬ 
stract notions in surprising ways. For all these cases, 
we can borrow what Howell et al. (2005 1 say about 
visual propagation to abstract words (p. 260): 


Intuitively, this is something like trying to explain 
an abstract concept like love to a child by using 
concrete examples of scenes or situations that are 
associated with love. The abstract concept is never 
fully grounded in external reality, but it does inherit 
some meaning from the more concrete concepts to 
which it is related. 


Of course, not all examples are good: the last col¬ 
umn of Figure shows cases with no obvious rela¬ 
tion between words and visual neighbours (subjects 
preferred the random images by a large margin). 

The multimodal vectors we induce also display an 
interesting intrinsic property related to the hypothe¬ 
sis that grounded representations of abstraet words 
are more eomplex than for eonerete ones, since ab¬ 
stract concepts relate to varied and eomposite situa¬ 
tions ( |Barsalou and Wiemer-Hastings, 2005 I. A nat¬ 
ural eorollary of this idea is that visually-grounded 
representations of abstraet eoneepts should be more 
diverse: If you think of dogs, very similar images of 
specifie dogs will eome to mind. You ean also imag¬ 
ine the abstraet notion of freedom, but the nature of 
the related imagery will be mueh more varied. Re- 
eently, Kiela et al. (20141 have proposed to measure 
abstractness by exploiting this very same intuition. 
However, they rely on manual annotation of pictures 
via Google Images and define an ad-hoc measure 
of image dispersion. We eonjeeture that the repre¬ 
sentations naturally indueed by our models display 
a similar property. In particular, the entropy of our 
multimodal vectors, being an expression of how var¬ 
ied the information they eneode is, should eorrelate 
with the degree of abstraetness of the eorresponding 
words. As Figure [^a) shows, there is indeed a dif- 
ferenee in entropy between the most eonerete {meat) 
and most abstraet {hope) words in the Kiela et al. set. 

To test the hypothesis quantitatively, we mea¬ 
sure the correlation of entropy and eoncreteness 
on the 200 words in the Kiela et al. (2014 1 setj^ 
Figure [^b) shows that the entropies of both the 


’since the vector dimensions range over the real number 
line, we calculate entropy on vectors that are unit-normed af¬ 
ter adding a small constant insuring all values are positive. 



Model 

P 

Word frequency 

0.22 

Kiela et al. 

-0.65 

Skip-gram 

0.05 

MMSkip-gram-B 

0.04 

MMSkip-gram-A 

-0.75 

MMSkip-gram-B* 

-0.71 


(b) 


Figure 3: (a) Distribution of MMSkip-GRAM-A 
veetor activation for meat (blue) and hope (red), (b) 
Spearman p between eonereteness and various mea¬ 
sures on thelKiela et al. (20141) set. 


MMSkip-GRAM-A representations and those gen¬ 
erated by mapping MMSkip-GRAM-B veetors onto 
visual spaee (MMSkip-GRAM-B*) achieve very 
high correlation (but, interestingly, not MMSkip- 
GRAM-B). This is further evidenee that multimodal 
learning is grounding the representations of both 
eonerete and abstract words in meaningful ways. 

6 Conclusion 

We introdueed two multimodal extensions of S KIP- 
GRAM. MMS KIP-GRAM- A is trained by directly 
optimizing the similarity of words with their visual 
representations, thus foreing maximum interaetion 
between the two modalities. MMSkip-GRAM-B in¬ 
cludes an extra mediating layer, aeting as a eross- 
modal mapping eomponent. The ability of the mod¬ 
els to integrate and propagate visual information re¬ 
sulted in word representations that performed well in 
both semantie and vision tasks, and eould be used as 
input in systems benefiting from prior visual knowl¬ 
edge (e.g., eaption generation). Our results with ab¬ 
straet words suggest the models might also help in 
tasks sueh as metaphor deteetion, or even retriev¬ 
ing/generating pietures of abstraet eoneepts. Their 
ineremental nature makes them well-suited for eog- 
nitive simulations of grounded language acquisition, 
an avenue of research we plan to explore further. 
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